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Abstract. Repeat finding in strings has important applications in subfields such as computational 
biology. The challenge of finding the longest repeats covering particular string positions was recently 
proposed and solved by Ileri et al., using a total of the optimal 0{n) time and space, where n is the 
string size. However, their solution can only find the leftmost longest repeat for each of the n string 
position. It is also not known how to parallelize their solution. In this paper, we propose a new solution 
for longest repeat finding, which although is theoretically suboptimal in time but is conceptually simpler 
and works faster and uses less memory space in practice than the optimal solution. Further, our solution 
can find all longest repeats of every string position, while still maintaining a faster processing speed and 
less memory space usage. Moreover, our solution is parallelizable in the shared memory architecture 
(SMA), enabling it to take advantage of the modern multi-processor computing platforms such as 
the general-purpose graphics processing units (GPU). We have implemented both the sequential and 
parallel versions of our solution. Experiments with both biological and non-biological data show that 
our sequential and parallel solutions are faster than the optimal solution by a factor of 2-3.5 and 6-14, 
respectively, and use less memory space. 
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1 Introduction 

Repetitive structures and regularities finding in genomes and proteins is important as these structures play 
important roles in the biological functions of genomes and proteins [7,21,20,19,14,4,1,13]. It is well known 
that overall about one-third of the whole human genome consists of repeated subsequences [17]; about 10- 
25% of all known proteins have some form of repetitive structures [14]. In addition, a number of significant 
problems in molecular sequence analysis can be reduced to repeat finding [16]. Another motivation for finding 
repeats is to compress the DNA sequences, which is known as one of the most challenging tasks in the data 
compression field. DNA sequences consist only of symbols from {ACGT} and therefore can be represented by 
two bits per character. Standard compressors such as gzip and bzip usually use more than two bits per 
character and therefore cannot achieve good compression. Many modern genomic sequence data compression 
techniques highly rely on the repeat finding in sequences [15, 2]. 

The notion of maximal repeat and super maximal repeat [7,1,12,3] captures all the repeats of the whole 
string in a space-efficient manner, but it does not track the locality of each repeat and thus can not support 
the finding of repeats that cover a particular string position. For this reason, Ileri et al. [8] proposed the 
challenge of longest repeat query, which is to find the longest repetitive substring(s) that covers a particular 
string position. Because any substring of a repetitive substring is also repetitive, the solution to longest 
repeat query effectively provides an effective “stabbing” tool for finding the majority of the repeats covering 
a string position. Ileri et al. proposed an 0(n) time and space algorithm that can find the leftmost longest 
repeat of every string position. Since one has to spend 17 (n) time and space to read and store the input 
string, the solution of Ileri et al. is optimal. 
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Our contribution. In this paper, we propose a new solution for longest repeat query. Although our solution is 
theoretically suboptimal in the time cost, it is conceptually simpler and runs faster and uses less memory space 
than the optimal solution in practice. Our solution can also find all longest repeats for every string position 
while still maintaining a faster processing speed and less space usage, whereas the optimal solution can only 
find the leftmost candidate. Further, our solution can be parallelized in the shared-memory architecture, 
enabling it to take advantage of the modern multi-processor computing platforms such as the general-purpose 
graphics processing units (GPU) [5,9]. We have implemented both the sequential and parallel versions of our 
solution. Experiments with both biological and non-biological data show that our solution run faster than 
the 0{n) optimal solution by a factor of 2-3.5 using CPU and 6-14 using GPU, and use less space in both 
settings. 

Road map. After formulating the problem of longest query in Section 2, we prepare some technical background 
and observations in Section 3 for our solutions. Section 4 presents the sequential version of our solutions. 
Following the interpretation in Section 4, it is natural and easy to get the parallel version of our solution, 
which is presented in Section 5. Section 6 shows the experimental results on the comparison between our 
solutions and the 0{n) solution using real-world data. 

2 Problem Formulation 

We consider a string S')! ... n], where each character S[i] is drawn from an alphabet S = {1,2,..., a}. A 
substring S[i... j] of S represents S[i]S[f -I- 1]... S[j] if 1 < i < j < n, and is an empty string if i > j. 
String S[i'... f] is a proper substring of another string S[i... j] if i < i' < f < j and j' — i' < j — i. 
The length of a non-empty substring S[U .. j], denoted as |S[i... j]|, is j — i -I- 1. We define the length of an 
empty string as zero. A prefix of S is a substring S[1 .. .i] for some i, 1 < i < n. A proper prefix S[1 .. .i] 
is a prefix of S where i < n. A suffix of S is a substring S'[i... n] for some i, 1 < i < n. A proper suffix 
S'[i... n] is a suffix of S where i > 1. We say the character S'[i] occupies the string position i. We say the 
substring S'[i .. .j] covers the fcth position of A, if i < fc < j. For two strings A and B, we write A = B (and 
say A is equal to B), if |A| = \B\ and A[i] = B[i] for i = 1, 2,..., |A|. We say A is lexicographically smaller 
than B, denoted as A < B, if (1) A is a proper prefix of B, or (2) A[l] < B[l\, or (3) there exists an integer 
fc > 1 such that A[i\ = B[i\ for all 1 < i < /c — 1 but A[k\ < B[k\. A substring ... j] of A is unique, if 
there does not exist another substring S[i'.. .j'] of S, such that S[i ... j] = S'[i'.../] but i ^ i'. A substring 
is a repeat if it is not unique. A character S'[i] is a singleton, if it appears only once in S. 

Definition 1. For a particular string position k € {1, 2,..., n}, the longest repeat (LR) covering po¬ 
sition k, denoted as LRk, is a repeat substring S'[i.. .j], such that: (1) i < k < j, and (2) there does not 
exist another repeat substring S'[i'.. .j'], such that i' < k < j' and j' — i' > j — i. 

Obviously, for any string position k, if S[k] is not a singleton, LR^ must exist, because at least S[k] itself 
is a repeat. Further, there might be multiple choices for LRj^. For example, if A = abcabcddbca, then LR 2 
can be either S')! ... 3] = abc or S)2 .. .4] = bca. In this paper, we study the problem of finding the longest 
repeats of every string position of S. 

Problem (longest repeat query): For every string position k € {1, 2,..., n}, we want to find LRk or the fact 
that it does not exist. If multiple choices for LRk exist, we want to find all of them. 

3 Preliminary 

The suffix array SA[1... n] of the string S is a permutation of {1,2,..., n}, such that for any i and j, 
1 < i < j < n, we have S)SA)i]... n] < S)SA(j]... n]. That is, SA)i] is the starting position of the ith suffix 
in the sorted order of all the suffixes of S. The rank array Rank[l.. .n] is the inverse of the suffix array. 
That is, Rank[i] = j iff SA[j] = i. The longest common prefix (Icp) array LCP\1 ... n -I- 1] is an array of 
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of an example string S = mississippi. 


Table 2: The number of walk steps in the LR’s calculation 
using our solutions for several 50MB files from Pizza&Chili ^ 


n+1 integers, such that for z = 2,3,..., n, LCP[i] is the length of the Icp of the two suffixes 5'[5^[z — 1]... n] 
and S'[S'^[z].. .n]. We set LCP[1] = LCP[n + 1] = 0. In the literature, the Icp array is often defined as 
an array of n integers. We include an extra zero at LCP[n + 1] is only to simplify the description of our 
upcoming algorithms. Table 1 shows the suffix array and the Icp array of the example string mississippi. 

Definition 2. For a particular string position k € {1,2,, n}, the left-bounded longest repeat (LLR) 
starting at position k, denoted as LLRk, is a repeat S'[fc .. .j], such that either j = n or S[k ... j + 1] is 
unique. 

Clearly, for any string position k, if S[k] is not a singleton, LLRk must exist, because at least S[k] itself 
is a repeat. Further, if LLRk does exist, there must be only one choice, because k is a fixed string position 
and the length of LLRk must be as long as possible. Lemma 1 shows that, given the rank and Icp arrays of 
the string S, we can directly calculate any LLRk or find the fact of its nonexistence. 

Lemma 1. For i = 1,2,... ,n: 


^ I S[i...i + Li-1] , if Li>0 

* { does not exist , if Li = 0 

where Li = mayi{LCP[Rank[i\[, LCP[Rank[i\ + 1]}. 

Proof. Note that Li is the length of the Icp between the suffix S'[z ... n] and any other suffix of S. If Li > 0, 
it means substring S'[z .. .i + Li — 1] is the Icp among 5'[z...n] and any other suffix of S. So S[i.. .i + Li — 1] 
is LLRi. Otherwise {Li = 0), the letter S'[z] is a singleton, so LLRi does not exist. □ 

Clearly, the left-ends of LLRi, LLR 2 ,. ■., LLRn strictly increase as 1, 2,..., n. The next lemma shows the 
right-ends of LLR’s also monotonically increase. 

Lemma 2. | LLRi \ < \ LLRi+i \ + 1, for every i = 1,2,... ,n — 1. 

Proof. The claim is obviously correct for the cases when LLRi does not exist (| LLRi | = 0) or | LLRi 1 = 1) 
so we only consider the case when | LLRi \ > 2. Suppose LLRi = S'[z.. .j], i < j. It follows that z -I- 1 < j. 
Since S'[z... j] is a repeat, its substring S'[z -|- 1... j] is also a repeat. Note that LLRi+i is the longest repeat 

substring starting from position z -|- 1, so | | > |S'[z -I- 1... j]| = | LLRi I — 1- C 

^ http://pizzachili.dec.uchile.cl/texts.html 


3 



Algorithm 1: Sequential finding of the leftmost LR^, k = 1,... ,n, using the raw LLR array. 


Input: The rank array and the Icp array of the string S 


/* Calculate LLRi, LLR2, ■ ■ ■, LLRn ■ 

1 for i = 1,2,... ,n do LLRr[i] <— maD<.{LCP[Rank[i]], LCP[Rank[i] + 1]}; 


*/ 

// Length of LLRi 
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/* Calculate LR\, LR 2 , ■ ■ ■, LRn . 
for k = 1,2,... ,n do 
LR ^ (-1,0) ; 
for i = k down to 1 do 

if i + LLRr\%\ — 1 < k then 
break; 


7 
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else if LLRr\i\ > LR .length then 
\_ LR^ {i,LLRr[i])-, 
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print LR-, 


*/ 

11 {start, end) ■. start and ending position of LRk- 

II LLRi does not exist or does not cover k. 

II Early stop 


Lemma 3. Every LR is an LLR. 

Proof. Assume that LRk = S'[i.. .j] is not an LLR. Note that S'[i... j] is a repeat starting from position i. 
If ... j] is not an LLR, it means S'[z... j] can be extend to some position / > j, so that 5'[z.../] is still 
a repeat and also covers position k. That says, |5'[z . ■ .j'\\ > |S'[i... j]|. However, the contradiction is that 
S'[i... j] is already the longest repeat covering position k. □ 

4 Two Simple and Parallelizable Sequential Algorithms 

We know every LR is an LLR (Lemma 3), so the calculation of a particular LRk is actually a search for the 
longest one among all LLR’s that cover position k. Our discussion starts with the finding of the leftmost LR 
for every position. In the end, an trivial extension will be made to find all LR’s for every string position. 


4.1 Use the raw LLR array 

We first calculate LLRi, for i = 1, 2,...,n, using Lemma 1, and save the result in an array LLRr[l.. .n], 
where each LLRr[i] = \ LLRi \- We call LLRr[l .. .n] the raw LLR array. Because the rightmost LLR that 
covers position k is LLRk and the right boundaries of all LLR’s monotonically increase (Lemma 2), the 
search for LRk becomes simply a walk from LLRk toward the left. The walk will stop when it sees an LLR 
that does not cover position k or it has reached the left end of the LLRr array. During this walk, we will 
record the longest LLR that covers position k. Ties can be broken by storing the leftmost such LLR. This 
yields the simple Algorithm 1, which outputs every LR as a {start, length) tuple, representing the starting 
position and the length of the LR. 

Lemma 4. Given the rank and lap arrays, Algorithm 1 can find the leftmost LRk for every k = 1,2,... ,n, 
using a total of 0(n) space and 0{an) time, where a is the average number of LLR’s that cover a string 
position. 

Proof. (1) The time cost for the LLRr array calculation is obviously 0(n). The algorithm finds the LR 
of each of the n string positions. The average time cost for each LR calculation is bounded by the average 
number of walk steps, which is equal to the average number of LLR’s that cover a string position. Altogether, 
the time cost is 0{an). (2) The main memory space is used by the rank, Icp, and LLRr arrays, each of which 
has n integers. So altogether the space cost is 0(n) words. □ 

Theorem 1. We can find the leftmost LRk for every k = 1,2,... ,n, using a total of 0(n) space and 0{an) 
time, where a is the average number of LLR’s that cover a string position. 
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Proof. The sufhx array of S can be constructed using existing 0(n)-time and space algorithms (For example, 
[11]). After the suffix array is constructed, the rank array can be trivially created using another 0(n) time 
and space. We can then use the suffix array and the rank array to construct the Icp array using another 
0(n) time and space [10]. Combining with Lemma 4, the claim in the theorem is proved. □ 

Extension: find all LR’s for every string position. As we have demonstrated in the example after Definition 1, 
a particular string position may be covered by multiple LR’s, but Algorithm 1 can only find the leftmost one. 
However, extending it to find all LR’s for every string position is trivial: During each walk, we simply report 
all the longest LLR’s that cover the string position, of which we are computing the LR. In order to do so, we 
will need to do the same walk twice. The first walk is to find the length of the LR and the second walk will 
actually report all the LR’s. We give the pseudocode of this procedure in Algorithm 3 in the appendix. This 
algorithm certainly has another extra 0{a) time cost on average for each string position’s LR calculation 
due to the extra walk, but it still gives a total of 0{an) time cost and 0{n) space cost. 

Corollary 1. We can find all LR’s covering every position k = 1, 2,..., n, using a total of 0(n) space and 
0{an) time, where a is the average number of LLR’s that cover a string position. 

Comment: (1) Algorithm 1 and 3 are cache friendly. Observe that the finding of every LR essentially is a 
linear walk over a continuous chunk of the LLRr array. For real-world data, the number of steps in every 
such walk is quite limited, as shown in the upper rows of Table 2. Note that the English dataset gives a 
much higher average number of walk steps, because the data was synthesized by appending several real-world 
English texts together, making many paragraphs appear several times. Because of the few walk steps needed 
for real-world data, the walking procedure can thus be well cached in the L2 cache, whose size is around 
several MBs in most nowadays desktops’ CPU architecture, making our algorithm much faster in practice. 
Note that the optimal 0{n) algorithm [8] uses a 2-table system to achieve its optimality, which however 
has quite a pattern of random accessing the different array locations during its run and thus is not cache 
friendly. We will demonstrate the comparison with more details in Section 6. (2) Algorithm 1 and 3 are 
parallelizable in shared-memory architecture. First, each LLR can be calculated independently by a separate 
thread. After all LLR’s are calculated, each LR can also be calculated independently by a separate thread 
going through an independent walk. This enables us to implement this algorithm on GPU, which supports 
massively parallel threads using data parallelism. 


4.2 Use the Compact LLR Array 

Observe that an LLR can be a substring (suffix, more precisely) of another LLR. For example, suppose 
S = ababab, then LLR4 = S'[4... 6] = bab, which is a substring of LLR^ = S'[3... 6] = abab. We know 
every LR must be an LLR (Lemma 3). So, if an LLRi is a substring of another LLRj, LLRi can never be 
the LR of any string position, because every position covered by LLRi is also covered by at least another 
longer LLR, LLRj. 

Definition 3. We say an LLR is useless if it is a substring of another LLR; otherwise, it is useful. 

Recall that in Algorithm 1 and 3, the calculation of a particular LRi is a search for the longest one among 
all LLR’s that cover position i. This search procedure is simply a walk from LLRi toward the left until it 
sees an LLR that does not cover position i or reaches the left end of the LLRr array. This search can be 
potentially sped up, if we have had all useless LLR’s eliminated before any search is performed. We will use 
a new array LLRc, called the compact LLR array, to store all the useful LLR’s in the ascending order of 
their left ends (as well as of their right ends, automatically). 

By Lemma 2, we know if LLRi-i is not empty, the right boundary of LLRi is on or after the right 
boundary of LLRi-i, for any i > 2. So, we can construct the LLRc array in one pass as follows. We will 
calculate every LLRi using Lemma 1, for i = 1, 2,..., n, and will eliminate every LLRi if | LLRi | = 0 or 
I LLRi I = I LLRi-i I — 1. Because of the elimination of the useless LLR’s, we will have to save each LLR as 
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(a) The raw LLR array 

Fig. 1: The geometric perspective of an example raw 
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a (start, length) tuple, representing the starting position and the length of the LLR, in the LLRc array. 
Figure 1 shows the geometric perspective of the elements in an example LLRr array and its corresponding 
LLRc array, where every LLR is represented by a line segment whose start and ending position represent 
the start and ending position of the LLR. 

Note that, in the LLRc array, any two LLR’s share neither the same left-end point (obviously) nor the 
same right-end point. In other words, the left-end points of all useful LLR’s strictly increase, and so do their 
right-end points, i.e., all the elements in the LLRc array have been sorted in the strict increasing order of 
their left-end (as well as right-end) points. See Figure lb for an example. Therefore, given a string position, 
we will be able to find the leftmost useful LLR that covers that position using a binary search over the LLRc 
array and the time cost for such a binary search is bounded by O(logn). After that, we will simply walk 
along the LLRc array, starting from the LLR returned by the binary search and toward the right. The walk 
will stop when it sees an LLR that does not cover the string position or it has reached the right end of the 
LLRc array. During the walk, we will just report the longest LLR that covers the given string position. Ties 
are broken by picking the leftmost such longest LLR. This leads to the Algorithm 2. 

Lemma 5. Given the rank and lap arrays, Algorithm 2 can find the leftmost LRk for every k = 1,2,... ,n, 
using a total of 0{n) space and 0{n(logn + P)) time, where {3 is the average number of useful LLR’s that 
cover a string position. 

Proof. (1) The time cost for the LLRc array calculation is obviously 0{n) time. The algorithm finds the LR 
of each of the n string positions. The average time cost for the calculation of the LR of one position includes 
the O(logn) time for the binary search and the time cost for the subsequent walk, which is bounded by the 
average number of useful LLR’s that cover a string position. Altogether, the time cost is 0(n(logn -I- fi)). 
(2) The main memory space is used by the rank, Icp, and LLRc arrays. Each of the rank and Icp arrays has 
n integers. The LLRc array has no more than n pairs of integers. Altogether, the space cost 0{n) words. □ 

Theorem 2. We can find the leftmost LRk for every k = 1,2,... ,n, using a total of 0{n) space and 
0{n{\ogn + j3)) time, where /3 is the average number of useful LLR’s that cover a string position. 

Proof. The suffix array of S can be constructed by existing algorithms using 0{n) time and space (For 
example, [11]). After the suffix array is constructed, the rank array can be trivially created using another 
0{n) time and space. We can then use the suffix array and the rank array to construct the Icp array using 
another 0(n) time and space [10]. Combining the results in Lemma 5, the theorem is proved. □ 

Extension: find all LR’s for every string position. Algorithm 2 can also be trivially extended to find all LR’s 
for every string position by simply reporting all the longest LLR that covers the position during every walk. 
In order to do so, we will need to walk twice for each string position. The first walk is to get the length of 
the LR and the second walk will report all the actual LR’s. We give the pseudocode of this procedure in 
Algorithm 4 in the appendix. This algorithm certainly has another extra 0(/3) time cost on average for each 
LR’s calculation due to the extra walk, but still gives a total of 0{n{logn + (3)) time cost and 0{n) space 
cost. 
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Algorithm 2: Sequential finding of the leftmost LR^, k = 1, 


Input: The rank array and the Icp array of the string S 
/* Calculate the compact LLR array. 

1 i <— 1; prev •«— 0; 

2 for i = 1,2,... ,n do 

3 L <— mauK{LCP[Rank[i]], LCP[Rank[i] + 1]}; 

4 if 1/ > 0 and L > prev then LLRc[j] {i, T); 1 A— i + 1; 

5 prev L: 


, n, using the LLRc array. 


// Length of LLRi 


6 sizej — 1 ; // Size of the LLRc array. 

/* Calculate LRi, LR 2 ,..., LR^ . */ 

7 for fc = 1, 2,..., n do 

8 LR <r- ( — 1,0) ; // {start, end) : start and ending position of LRk. 

9 start BinarySearch(iL7?c, fc)/* Return the smallest index of the LLRc array element that 

covers position k, if such element exists; otherwise, return —1. */ 

10 if start yf — 1 then 

11 for i = start... size do 

12 if LLRc[i].start + LLRc[i].length — 1 < k then // LLRc[i] does not cover k. 

13 L break; // Early stop 

14 else if LLRc[i].length > LR .length then 

15 LR ■«— LLRc[i\, 


print LR-, 


Corollary 2. We can find all LR’s of every string position k = 1,2,... ,n, using a total of 0(n) spaee and 
0{n{\ogn + (3)) time, where /3 is the average number of useful LLR’s that cover a string position. 

Comment: (1) The binary searches that are involved in Algorithms 2 and 4 are not cache friendly. However, 
compared with Algorithms 1 and 3, Algorithms 2 and 4 on average have much fewer steps (the (3 value) 
in each walk due to the elimination of the useless LLR’s (see the bottom rows of Table 2). This makes 
Algorithms 2 and 4 much better choices rather than Algorithms 1 and 3 for run environments that have 
small cache size. Such run environments include the GPU architecture, where the cache size for each thread 
block is only several KBs. We will demonstrate this claim with more details in Section 6. (2) With more 
care in the design. Algorithms 2 and 4 are also parallelizable in shared-memory architecture (SMA), which 
is described in the next Section. 

5 Parallel Implementation on GPU 

In this section, we describe the GPU version of Algorithms 1 and 2 and their extensions (Algorithms 3 and 4). 
After we construct the SA, Rank, and LGP arrays on the host GPU we transfer the Rank array and the 
LCP array to the GPU device memory. We start with the calculation of the raw LLR array in parallel. 

Compute the raw LLR array. After the LCP and Rank arrays are loaded into GPU memory, we launch a 
CUBA kernel to compute the raw LLR array on GPU device using massively parallel threads, as illustrated 
in Figure 2. Each thread ti on the device computes a separate element LLRr[i] = | LLRi \ using the following 
equation from Lemma 1. 

LLR[i\ = Tiiax.{LCP[Rank[i\, LCP[Rank[i\ -1-1]} 

^ The SA, Rank, and LCP arrays can also be constructed in parallel on GPU [18, 6], but due to the unavailability of 
the source code or executables from the authors of [18,6], we choose to construct these arrays on the host CPU, 
without affecting the demonstration of the performance gains by our algorithms. 
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SA, Rank, LCP array construction on CPU 


Parallelized on GPU 


I Compute: LLRr[i] — inax{LC'P[fiaTiA:[i]], LCP[Ra'nk[i] + 1]}, for i = 1,2,,., , 71 "! 



Fig. 2: Overview of the GPU Implementation 


t 2 ts ti ts te tj ts GPU threads 
I l| 0| 0| l| l| 0| 0| l| Flag array 

|3|2|l|l|3|2|l|l| raw LLR array 



I <1,3> I <4,1> I <5,3> I <8,1> I compact LLR array 


I l| l| l| 2 \ 3| 3| 3| 4 \ Prefix_Sum array 

Fig. 3: LLR compaction on GPU 


Since each LLRi must start with string position i, we only need to save the length of each LLRi in LLRr[i]. 
After creating the raw LLR array, we have two options, which in turn lead to two different parallel solutions: 
using the raw LLR array or the compact LLR array. 

5.1 Compute LR’s using the raw LLR array 

The parallel implementation of Algorithm 1 using the raw LLR array is straightforward, as presented by 
the left branch of Figure 2. With the raw LLR array returned from the previous kernel launch on the GPU 
device, we launch a second kernel for LR calculation. Each CUBA thread ti on the device is to find LRi by 
performing a linear walk in the LLRr array, starting at LLRr[i] toward the left. The walk continues until it 
finds an LLRr array element that does not cover position i or has reached the left end of the LLRr array. The 
leftmost or all LRi can be reported during the walk, as discussed in Algorithm 1. Note that in this search, 
each GUDA thread checks a chunk of contiguous elements in the LLRr array and this can be cache-efficient. 

Taking the calculation of LRiq using the raw LLR array shown in Figure la as an example. The corre¬ 
sponding GUDA thread tio searches a contiguous chunk of the LLRr array starting from index 10 down to 
left in the LLRr array. We do not search the LLRr elements that are to the right of index 10, because these 
elements definitely do not cover position 10 according to the definition of LLR. In particularly, thread Go 
goes through LLRr[10], LLi?r[9], LLRr\S\ and LLRr\J] to find the longest one among the four of them as 
LRiq. Thread Uo stops the search at LLRr position 6 , because LLRr[&\ and all LLR’s to its left do not cover 
position 10 (Lemma 2). 

5.2 Compute LR’s using the compact LLR array 

LLR Compaction. The right branch of Figure 2 shows the second option in computing LR’s on GPU. That 
is to use the compact LLR array. We first create the compact LLR array, named as LLRc, from the raw LLR 
array, which has been created and preserved on the device memory. To avoid the expensive data transfer 
between the host and the device and to achieve more parallelism, we perform the LLR array compaction 
on the GPU device in parallel. We launch three GUDA kernels to perform the compaction, denoted as /Ci, 
/C 2 , and /C 3 . As shown in Figure 3, after the LLRr array is constructed on the device, we first launch kernel 
/Cl to compute a flag array Flag[l.. .vi] in parallel, where the value of each element Flag[i] is assigned 
by a separate thread U as follows: (1) Flag[l] = 1 iff LLRr[l] > 0. (2) Flag[i] = 1, iff LLRr[i\ > 0 and 
LLRr[i\ > LLRr[i — 1], for i = 2, 3,..., n. Flag[i] = 0 means LLRi is useless and thus can be eliminated. 

After the Flag array is constructed from kernel /Ci, we launch kernel IC 2 to calculate the prefix sum of 
the Flag array on the device: Prefix-Sum[i] = We modify the prefix sum function provided 

by the GUDA toolkit for this purpose. 

With the prefix sum array and the Flag array, we launch kernel /C 3 to copy the useful LLRr array elements 
into the LLRc array, as illustrated in Figure 3. Each thread U on the device moves in parallel the LLRi to 
an unique destination LLRc[Prefix-Sum[i]], if Flag[i] = 1. That is, LLRc[Prefix-Sum[i]] = {i, LLRr[i]), if 
Flag[i] = 1. Each element in the LLRc array is a useful LLR and is represented by a tuple of {start, length), 
the start and ending position of the LLR. 







Compute LR’s. After the LLRc array is prepared, we calculate the LR for every string position in parallel. 
Recall that the calculation of each LRk, for each A: = 1, 2,..., n, is a search for the longest useful LLR that 
covers position k. We also know all these relevant LLR’s that we need to search comprise a continuous chunk 
of the LLRc array. The start position of the chunk can be found using a binary search as we have explained 
in the discussion of Algorithm 2. After that, a simple linear walk toward the right is performed. The walk 
continues until it finds an LLRc array element that does not cover position k or has reached the right end 
of the LLRc array. 

To compute the LR’s using the LLRc array, we launch another CUBA kernel, in which each CUBA 
thread tk first performs a binary search to find the start position of the linear walk and then walk through 
the relevant LLRc array elements to find either all LR’s or a single LR covering position k. 

Referring to Figure lb, we take the LR calculation covering the string position 9 as an example. Recall 
that we have discarded all useless LLR’s in the LLRc array, so the LLRc array element at index 9 is not 
necessarily the rightmost LLR that cover string position 9. Therefore, we have to perform a binary search 
to locate that leftmost LLRc array element by taking advantage of the nice property of the LLRc array that 
both the start and ending positions of all LLR’s in it are strictly increasing. After thread tg locates the LLRc 
element LLi?c[4], the leftmost useful LLR that covers the string position 9, it performs a linear walk toward 
the right. The walk will continue until it meets LLRc[6], which does not cover position 9. Thread tg will 
return the longest ones among LLRc [4 ... 6] as LRg. 


5.3 Advantages and Disadvantages: LLRr vs. LLRc 

When the raw LLR array is used, the algorithm is straightforward and easy to implement, because there 
is no needs to perform the LLR compaction on the device. However, with a raw LLR array, we could have 
a large number of useless LLR’s in the raw LLR array, especially when the average length of the longest 
repeats is quite large. For that reason, the subsequent linear walk for each CUBA thread can take many 
steps, making the overall search performance worse. 

In contrast, under a compact LLR array, we have to perform the LLR compaction, which involves data 
coping and requires extra memory usage for the Flag and the prefix sum array on the device. In addition, 
a binary search, which is not present with a raw LLR array, is required to locate the first LLR for the 
linear walk. The advantage of a compact LLR array is that we remove the useless LLR’s and dramatically 
shorten the linear walk distance.We provide more analysis and comparison between these two solutions in 
the experiment section. 


6 Experimental Study 

Experiment Environment Setup. We conducted our experiments on a computer running GNU/Linux with 
a kernel version 3.2.5I-I. The computer is equipped with an Intel Xeon 2.40GHz E5-2609 CPU with 10MB 
Smart Cache and has 16GB RAM. We used a GeForce GTX 660 Ti GPU for our parallel tests. The GPU 
consists of 1344 CUBA cores and 2GB of RAM memory. The GPU is connected with the host computer 
with a PCI Express 3.0 interface. We install CUBA toolkit 5.5 on the host computer. We use C to implement 
our sequential algorithms and use CUBA C to implement our parallel solutions on the GPU, using gcc 4.7.2 
with -03 option and nvcc V5.5. 0 as the compilers. We test our algorithms on real-world datasets including 
biological and non-biological data downloaded from the Pizza&Ghili Corpus. The datasets we used are the 
three 50MB DNA, English, and Protein pure ASCII text files, each of which thus represents a string of 
50 X 1024 X 1024 characters. 

Measurements. We measured the average time cost of three runs of our program. In order to better highlight 
the comparison of the algorithmics between the old and our new solutions, we did not include the time 
cost for the I/O operations that save the results. For the same purpose, we also did not include the time 
cost for the SA, Rank, and LCP array constructions, because in both the old and our new solutions, these 
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auxiliary data structures are constructed based on the same best suffix array construction code available on 
the Internet Our source code for this work is also available on website.^ 


6.1 Time 

In the top three charts of Figure 4, using three datasets, we compare different algorithms that return only 
the leftmost LR for every string position of the input data. In the bottom three charts, we present the 
performance of our algorithms that are able to find all LR’s for every string position. We compare onr new 
algorithms with the existing optimal sequential algorithm [8], which can only find the leftmost LR for every 
string position. Table 3 summarizes the speedup of our algorithms against the old optimal algorithm. From 
experiments, we are able to make the following observations. 

Sequential algorithms on CPU. Our new sequential algorithm using the raw LLR is consistently faster than 
the old optimal algorithm by a factor of 1.97-3.44, while our new sequential algorithm that uses the compact 
LLR array is consistently slower. This observation is true in both finding the leftmost LR and all LR’s. 
(Please note that the old optimal algorithm always finds the leftmost LR only.) 

On the host CPU, three dominating factors contribute to the better performance of algorithms using a 
raw LLR array rather than using a compact LLR array. First, although the compact LLR array can still be 
constructed in one pass, but the construction involves a lot more computational steps than those needed in 
the construction of the raw LLR array. Second, sequential algorithms that use a compact LLR array require 
a binary search in order to locate the starting position of the subsequent linear walk in the calculation of 
every LR. However, binary searches are not required if we work with a raw LLR array. As it is known, 
binary search over a large array is not cache friendly. Through profiling, we observe that the binary search 
operations consume from 63% to 73% of the total execution time. Third, even though for some datasets the 
search range size (or the number of walk steps) with a raw LLR array could be 10, 000 times larger than that 
using a compact LLR array, as shown in table 2, the L2 cache (10MB) of the host CPU is large enough to 
cache the range of contiguous LLR’s that each linear walk needs to go through. Such efficient data caching 
helps all walks take less than a total of 100 milliseconds on the host CPU, accounting for less than 5% of the 
total execution time, even with the raw LLR array. In other words, given a large cache memory, the number 
of walk steps is no longer a dominating factor in the overall performance. 

Parallel algorithms on GPU. Our new parallel algorithm on GPU using the compact LLR array is consistently 
faster than its counterpart that uses the raw LLR array, which is consistently faster than the old optimal 
algorithm by a factor of 8.32-14.62 in finding the leftmost LR and 6.36-10.35 in finding all LR’s. 

Unlike the sequential algorithm on the host CPU, the performance of the parallel algorithm on the GPU 
device is dominated by the number of LLR’s (the number of walk steps) that each walk will go through. As 
we profile our GPU implementation, we observe that with the raw LLR array, all linear walks on the GPU 
take roughly a total of eight seconds for the English dataset. But, the walks take roughly 70 milliseconds 
only if using a compact LLR array on the GPU. This is because: (1) the small GPU L2 cache (384KB shared 
by all streaming multiprocessors) cannot host as many LLR’s as what the CPU L2 cache (10MB) can host, 
resulting in more cache-read misses and more expensive global memory accesses. (2) The number of walk 
steps with a compact LLR array is less than that with a raw LLR array by a factor of up to four orders of 
magnitude (see Table 2). (3) The extra time cost for the LLR compaction that is needed when using the 
compact LLR array become much less significant in the total execution time on GPU. On the host CPU, 
our sequential solution takes roughly 1.3 seconds to perform the LLR compaction for datasets of 50MB 
and accounts for 20% of the total time cost on average. However, it takes less than 30 milliseconds on the 
GPU, accounting for only 9.5% of the entire time cost. We achieve more than 40 times speedup in the LLR 
compaction by utilizing GPU device. 

® http://code.google.com/p/libdivsufsort/ 

^ http://penguin.ewu.edu/~bojianxu/publications 
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Fig. 4: Time Cost vs Dataset Size. The three charts on the top and bottom show the experimental results 
on hnding the leftmost and all repeats of every string position, respectively. 



Sequential 
No Compact 
Leftmost 

Sequential 
No Compact 
All 

Parallel 

Compact 

Leftmost 

Parallel 

Compact 

All 

DNA 

2.91x 

2.91x 

13.48x 

9.43x 

English 

1.97x 

1.97x 

8.32x 

6.36x 

Protein 

3.44x 

3.44x 

14.62x 

10.35X 


Table 3: Speedup with 50MB Files 



Old (MBs) 

Onrs (MBs) 

Space Saving 

DNA 

792.77 

650.39 

17.96% 

English 

654.02 

650.39 

0.56% 

Protein 

773.53 

650.39 

15.92% 


Table 4: RAM Usage Comparison for 50MB Files 


The first two reasons above are reassured by the experimental results regarding the English dataset, 
which we purposely chose to use. The English file is synthesized by simply concatenating several English 
texts, and thus the text has many repeated paragraphs, which in turn creates many useless LLR’s in the 
data. In this case, with the raw LLR array, each walk will have a large number of steps due to such useless 
LLR’s. However, after we compact the raw LLR array, the number of walk steps can be signihcantly reduced 
(Table 2) and consequently the GPU code’s performance is significantly improved (Figure 4). 

6.2 Space 

Table 4 shows the peak memory usage of both the old and our new algorithms for datasets of size 50MBs. 
The memory usage of all of our algorithms is the same. This is because the space usage by the SA, Rank, 
and LCP array dominate the peak memory usage of all of our algorithms. On the other hand, due to its 
2 -table system that helps achieve the theoretical 0(n) time complexity, the old optimal algorithm’s space 
usage is relevant to the dataset type and is higher than ours. 

6.3 Scalability 

Although our algorithms have a superlinear time complexity in theory, but they all scale well in practice as 
shown by Figure 4. As we increase the size of the test data, we observe a consistent speedup. In addition, we 
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did conduct experiments on datasets of 100MB on the GPU device by using a 2D grid of CUBA threads in 
order to create more than 100 million threads on the device. When finding the leftmost LR for each string 
position, we observed the same speedups as shown in Figure 4. 

On the host CPU, the large cache size dramatically reduces the total number of memory reads during 
the linear walk in a raw LLR array and thus enables us to eliminate the expensive binary search operations 
by using a raw LLR array. On the GPU device, although all data is stored in the global memory, a compact 
LLR array helps greatly reduce the total number of global memory access; each thread linearly searches a 
smaller number of LLR’s. As shown in Table 2, the average number of walk steps in a compact LLR array 
is no more than six, which enables the linear walk to be considered as a constant-time operation. 

7 Conclusion and Future Work 

We proposed conceptually simple and thus easy-to-implement solutions for longest repeat finding over a 
string. Our algorithm although is not optimal in time theoretically, but runs faster than the old optimal 
algorithm and uses less space. Further, our algorithm can find all longest repeats of every string position, 
whereas the old optimal solution can only find the leftmost one. Our algorithm can be parallelized in shared- 
memory architecture and has been implemented on GPU using the data parallelism to gain further speedup. 

Our GPU solution is roughly 4.5 times quicker than our best sequential solution on the GPU, and up 
to 14.6 times quicker than the old optimal solution on the CPU. Also, we improve the LLR compaction 
performance by a factor of 40 on GPU. The multiprocessors in our current GPU have a built-in LI and L2 
cache, which help coalesce some global memory accesses. In the future, we will further optimize our parallel 
solution by utilizing the GPU shared memory or texture memory to further reduce global memory access. 
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Appendix 


Algorithm 3: Sequential finding of all LRk for k = 1,2,... ,n, using the raw LLR array. 


Input: The rank array and the Icp array of the string S 


/* Calculate LLRi, LLR 2 t ■ ■ . , LLRn ■ 

1 for i — 1, 2, . . . , n do LLRr[i] max-[LCP[Rank[i]], LCP[Rank[i] + 1]}; 


2 

3 

4 

5 

6 

r 

8 

9 

10 

11 

12 

13 


/* Calculate LR\^ LR 2 , ■ . . , LRn . 
for k — 1 , 2 , . . . , n do 

/* Calculate the length of LR^. 

length ■<— 0 ; 

for i — k down to 1 do 

if i + LLRr[i] — 1 < k then 
1 ^ break; 

else if LLRr[i] > length then length LLRr[i]\ 

/* Calculate all LR^. 
if length > 0 then 

for i — k down to 1 do 

if i + LLRr[i] — 1 < k then 
1 ^ break; 

else if LLRr[i] — length then 
1^ print {i, LLRr[i])\ 
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else print ( — 1 , 0 ); 


*/ 

// Length of LLRi 
*/ 

*/ 

11 Length of 

11 LLRi does not exist or does not cover k. 

II Early stop 


*/ 


// LLRi does not exist or does not cover k. 

// Early stop 

// the start and ending positions of LR^, 
// LRk does not exist. 


Algorithm 4: Sequential finding of all LRk for k = 1,... ,n, using the LLRc array. 

Input: The rank array and the Icp array of the string S 

I* Calculate the compact LLR array. */ 

1 j 1\ prev 0 ; 

2 for i — 1 , 2 , . . . , n do 

L ■<— ina,x{LCP[Rank[i]]j LCP[Rank[i] + 1]}; 
if L > 0 and L > prev then LLRc[j] (z, L); j j -\- 1; 
prev <— L; 


// Length of LLRi 

II Size of the LLRc array. 


6 size ■<— j — 1 ; 

/* Calculate LRi, LR 2 , . . ■ , LRn . ♦/ 

r for k — 1 , 2 , . . . , n do 

start <— 3in.arySea.rch{LLRc, k)l* Return the smallest index of the LLRc array element that covers position k, if 
such element exists; otherwise, return —1. */ 

if start 7 ^ —1 then 

/* Find the length of | LRk |. */ 

length <— 0 ; 

for i — start . . . size do 

if LLRc[i].start LLRc[i].length — l<k then // LLRc[i] does not cover k. 

1^ break; // Early stop 

else if LLRc[i].length > length then length LLRc[i].length; 

*1 


I* Report all LRk- 
for i — start. . . size do 

if LLRc[i].start + LLRc[i].length — 1 < fe then 
1 ^ break; 

else if LLRc[i].length — length then 

print {LLRc[i].start., LLRc[i].length) ; 


else print ( — 1 , 0 ); 


// LLRc[i] does not cover k. 

// Early stop 

// {start, end): start and ending pos of LRk • 
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